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Abstract 

Background: The interest of the scientific community in investigating the impact of rare variants on complex traits 
has stimulated the development of novel statistical methodologies for association studies. The fact that many of 
the recently proposed methods for association studies suffer from low power to identify a genetic association 
motivates the incorporation of prior knowledge into statistical tests. 

Results: In this article we propose a methodology to incorporate prior information into the region-based score test. 
Within our framework prior information is used to partition variants within a region into several groups, following 
which asymptotically independent group statistics are constructed and then combined into a global test statistic. 
Under the null hypothesis the distribution of our test statistic has lower degrees of freedom compared with those 
of the region-based score statistic. Theoretical power comparison, population genetics simulations and results from 
analysis of the GAW17 sequencing data set suggest that under some scenarios our method may perform as well as 
or outperform the score test and other competing methods. 

Conclusions: An approach which uses prior information to improve the power of the region-based score test is 
proposed. Theoretical power comparison, population genetics simulations and the results of GAW17 data analysis 
showed that for some scenarios power of our method is on the level with or higher than those of the score test 
and other methods. 
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Background 

In spite of the success of genome-wide association stud- 
ies (GWAS) in identif)^ing hundreds of common single 
nucleotide polymorphisms (SNPs) associated with dis- 
eases and complex traits (http://www.genome.gov/gwa 
studies/), in many cases the proportion of heritability ex- 
plained by these discovered SNPs is low [1-3]. One of 
the potential explanations of this observation is that rare 
variants (usually defined as those with minor allele fre- 
quency below 1%), which are absent from conventional 
GWAS studies, are responsible for the missing heritabil- 
ity. Indeed, there is strong evidence that rare variants 
are associated with some complex traits [4-8]. 
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When performing rare variants association analysis re- 
searchers face significant methodological challenges. The 
single SNP approach, which is popular in GWA studies, 
is underpowered when applied in rare variants associ- 
ation studies with moderate sample size due to low allele 
count for each individual rare variant. To overcome this 
problem, testing multiple rare variants within a region 
has been recommended [9,10], and a number of region- 
based rare variants tests have been proposed [11-15]. 
However, these novel methodologies may still be under- 
powered for association studies with moderate sample 
size [16]. This motivates the development of statistical 
methods that utilize prior information with the purpose 
of improving power. Currently, much biological informa- 
tion is publicly available, such as the prediction of degree 
of deleteriousness of non-synonymous variants (PolyPhen 
http://genetics.bwh.harvard.edu/pph2/, SIFT http://sift.bii. 
a-star.edu.sg/), SNP prioritization (FastSNP http://fastsnp. 
ibms.sinica.edu.tw), functional SNP annotation (SNPnexus 
http://www.snp-nexus.org/) etc. Although several methods 
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that use prior information have been proposed [17-20], fur- 
ther research is needed to utilize prior knowledge more effi- 
ciently [21] and to expand statistical tools available for 
researchers. 

In this article we propose a method that incorporates 
prior information into the region-based score test. Within 
our framework, prior information is used to partition 
SNPs within a genomic region of interest into groups. 
Then within each group asymptotically independent SNP 
scores are combined into a one degree of freedom (d.f.) 
chi-squared statistics. These group statistics are then used 
to construct a global test for a region. The proposed meth- 
odology has several distinct advantages. First of all, the de- 
grees of freedom of our test statistic equals to the number 
of groups, which may be much less than the number of 
SNPs within a region. Secondly, under partitioning that 
approximately separates associated variants from neutral 
ones ("informative" partitioning) the proposed method ef- 
ficiently handles noise introduced by neutral variants. We 



have evaluated the performance of our method on theor- 
etical power comparison, population genetics simulations 
and analysis of the GAW17 (http://www.gaworkshop.org/ 
gawl7/) real sequencing data set. The results showed that 
under some scenarios the proposed methodology per- 
formed as well as or outperformed the score test and 
other competing methods. 

Results 

Theoretical power comparison 

Figure 1 shows the difference between the theoretical 
power of the proposed method and those of the score 
test for different scenarios at the fixed type-1 error rate 
of 0.05. For Panel 1 of Figure 1, the number of SNPs 
within a region L=10, the number of groups in a parti- 
tioning /<r=2, the range of values for the non-centrality 
parameter (NCP) r 0-30 (x-axis), and the number of 
variants in a causal group Li =2,4,6,8 (different values of 
Li correspond to different curves). As can be seen, the 





Figure 1 The difference in theoretical power (vertical axis) of the proposed test and the score test as a function of the total non- 
centrality parameter r (horizontal axis) at the type-1 error rate a = 0.05. Each curve corresponds to the number of SNPs in the single causal 
group /-I given in the legend (Panels 1 and 2), the number of groups K (Panel 3), and the number of causal groups m (Panel 4). The parameters 
for each of the Panels are as follows: Panel 1: /. = 10, /(= 2; Panel 2: /.= 100, /(= 10; Panel 3: L = 50, =5; Panel 4: L = 54, K= 6, equal number of 
SNPs in each group, and equal non-centrality parameters in all causal groups. 
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power gain of the proposed method reached as high as 
15 percent when the number of SNPs in the causal 
group Li was not large. However, the power of the pro- 
posed test was lower than those of the score test when 
Li was large. This can be explained from the formula 
(3), which implies that with an increase in the number 
of variants in a causal group Li and fixed NCP r the 
value of the group statistic Hi monotonically decreases. 
Also, it is noticeable that with an increase in NCP r up to 
some point the power difference also increased when Li 
was small, whereas the power difference decreased for large 
values of Li. For very large values of NCP r the power dif- 
ference became close to 0 as theoretical powers of both 
tests tended to 1. Similar conclusions can be obtained from 
Panel 2 of Figure 1 which shows the results for the scenario 
with the following parameters: L=100, K=10, varying NCP 
r (x-axis), and varying Li= 10,20,30,40. 

To investigate the impact of the number of groups K 
on power the following model was considered (Panel 3 
of Figure 1): 1=50, Ii=5, and /C=8,16,24,32. As can be 
seen, the power difference is below zero for large values 
of K and above zero for small values of K, This may be 
explained by the fact that the test statistic Ti (4) is dis- 
tributed as chi-squared random variable with K d.f. 
Thus, for the fixed NCP r and increasing value of K the 
power of the proposed method monotonically decreases. 

Alongside the models considered above, it is important 
to include a scenario when causal variants are split be- 
tween several groups. Panel 4 of Figure 1 depicts the 
power difference under the following model: L = 54, K = 
6, and the number of causal groups m = 2,3,4,5. Also, 
for simplicity of presentation it was assumed that each 
group contained equal number of variants (54/6 = 9), 
and NCP r was split equally between all causal groups. 
For this scenario the calculation of theoretical distribu- 
tion of Ti (4) was done using 500,000 simulations of test 
statistics under the alternative hypothesis, as equation 
(6) cannot be used here. As can be seen from Panel 4 of 
Figure 1, our methodology could gain power if associ- 
ated SNPs were split between several groups. When the 
number of causal groups became large, our method was 
slightly worse than the score test. It should be noted that 
for all models the maximum power gain of our method 
(and equivalently the maximum power loss) was achieved 
when the power of the score test was around 50%. Conclu- 
sions similar to those above can be derived from Additional 
file 1 which shows the same scenarios for the fixed type-1 
error rate set at the genome-wide level assuming 35,000 
genes {a = 0.05/35000). 

Population genetics simulations results 

Using population genetics simulations we compared our 
method with the score test and other proposed ap- 
proaches, namely, weighted selective collapsing strategy 



(WSCS) [22], variable threshold (VT) [18], weighted 
sum of squared scores test (SSUw) [23], and optimal se- 
quence kernel association test (SKAT-O) with linear 
kernel and beta weights [24]. Since these tests utilize 
only genotype and phenotype information, we applied 
our method with MAF (minor allele frequency) parti- 
tioning: variants within a region are divided into two 
groups, namely, those with observed MAF above and 
below 1%. To estimate the empirical type-1 error we ran 
the analysis of simulated data under the null model of 
no association (Additional file 2). As can be seen from 
Additional file 2, the type-1 error was well controlled 
for all the tests. 

Figure 2 shows the results of the population genetics 
simulations. As can be seen, our method was the most 
powerful for both "Low Frequency" and "Common" 
phenotype models, closely followed by WSCS and SSUw 
respectively. For the "Interaction" model WSCS achieved 
slightly higher power than other methods. For the "Rare" 
model WSCS and VT tests showed the highest power, 
whereas our method performed worse than most other 
methods. Although in this scenario all the causal vari- 
ants had MAF < 1% in a haplotype pool, some of the 
causal variants were expected to have MAF > 1% in a 
data replicate, since causal alleles increased the probabil- 
ity of a disease. Thus, in this case applying MAF parti- 
tioning could result in causal variants with high MAF 
being combined with neutral common SNPs. 

GAW17 analysis results 

In addition to theoretical power comparison and popula- 
tion genetics simulations we also used the GAW17 data 
set to compare our test with other methods. The 
GAW17 data set was designed to mimic a real exome 
sequencing study of a complex disease. Full description 
of the method used to generate the GAW17 data can be 
found in Almasy et al. [25]. Briefly, the whole-exome 
sequencing data from 1000 Genomes Project (http:// 
www.1000genomes.org) was the basis for simulation 
of 200 replicates of dichotomous phenotype and 200 
replicates of 3 quantitative traits (endophenotypes) in 
697 unrelated individuals from six populations (Chinese, 
Japanese, Yoruba, Luhya, Utah residents with Northen 
and Western European ancestry - CEPH, and Tuscan). 
Causal variants were chosen to be common and rare 
non-synonymous SNPs concentrated in genes from specific 
pathways. Also, dichotomous phenotype and quantitative 
traits were impacted by covariates such as smoking status, 
gender and age. 

We conducted an association analysis of known causal 
genes with two quantitative traits (third quantitative 
trait was simulated independently from genotype) and a 
dichotomous phenotype adjusting for covariates and 
population stratification. The adjustment procedure was 
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Figure 2 Comparison of thie proposed methiod withi MAF partitioning and otiier statistical tests on population genetics simulations. 

In the "Rare" phenotype model only rare variants (MAF<1% in haplotype pool) were causal with uniform effect size. "Low Frequency" and 
"Common" phenotype models had only one low frequency (MAF between 1% and 5%) and one common (MAF>5%) causal SNP respectively. 
Finally, the "Interaction" scenario models the interaction of rare variants with a common SNP. A minor allele of a rare causal variant had an 
impact on phenotype if and only if it was present on the same haplotype as a minor allele of a common SNP chosen beforehand. 



similar to those described by Jiang and Dong [26]. Let G 
be the genotype matrix of a gene under investigation; 
Qi,Q2,D - vectors of two quantitative traits and a di- 
chotomous phenotype respectively; = 1,2,3 - vectors 
of age, gender and smoking status respectively; R - 
matrix of ten principal components obtained from 
Eigenstrat [27]. First, phenotypes, genotype and co- 
variates were adjusted for population stratification 
as follows: adjusted genotype G = G-RR^G; adjusted 
phenotypes Qi = Qi-RR^Q,,Q2 = Q2-RR^ Q2.D = D- 
RR'^ D; adjusted covariates E , = Ei-RR^Ei, i = 1,2,3, 
Second, covariates were regressed out from adjusted 
phenotypes using the following models: 

_ 3 _ _ 3 _ 

Q 1 = ^0 + ^^iE / + Q 2 = ^0 + ^biE + 

i=l i=l 
_ 3 _ 

D = Co + ^CiE i + d 

i=l 

(1) 

where q\yq2 and d are regression residuals. These resid- 
uals were tested for an association with adjusted ge- 
notype G. Statistical significance for each method was 
assessed using 1000 permutations. The power was calcu- 
lated as a proportion of phenotype replicates significant 
at the type-1 error rate of 0.05. 

GAW17 analysis results: comparison with the score test 

We considered three partitionings for our method: MAF 
partitioning (two groups: variants with MAF above and 
below 1%), functional partitioning (two groups: non- 
synonymous variants; synonymous and unknown vari- 
ants), and combined partitioning (four groups defined by 
MAF and functionality). To assess the empirical type-1 



error rate of our method and those of the score test we 
ran the analysis with randomly permuted residuals qi,q2 
and d from (1). Additional file 3 shows the empirical 
type-1 error rates for Qi and Q2 traits (Panel 1) and di- 
chotomous phenotype (Panel 2). It should be noted that 
the estimate of type-1 error rate is distributed as an ob- 
served probability of success for a binomial random vari- 
able with the sample size of 200 (number of phenotype 
replicates) and the probability of success 0.05 (theoret- 
ical type-1 error rate). The double-sided 99% confidence 
interval for the estimate of type-1 error rate is 0.015- 
0.095. As can be seen from Additional file 3, the type-1 
error was well controlled. 

Panel 1 of Figure 3 shows the results for our method and 
the score test on genes that impacted Qi and Q2 (genes 
from ARNT to VEGFA were tested on association with Q^, 
genes from BCHE to VWF - with Q2). Panel 2 of Figure 3 
depicts the results of the analysis for the dichotomous trait. 
As can be seen, the score test was significantly more power- 
ful for KDR, BCHE and S0S2 genes (dichotomous 
phenotype), whereas for at least one partitioning the pro- 
posed method was substantially more powerful for ARNT, 
ELAVL4, HIFIA genes (Qi trait); VNNl gene (Q2 trait); 
FLTl and PIK3C3 genes (dichotomous phenotype). Table 1 
shows our suggested reasons for the difference in perform- 
ance between the score test and our proposed method for 
the genes listed above. As can be seen, for all the genes for 
which our method outperformed the score test, some of 
the considered partitionings contained a group consisting 
of only associated variants. This is likely to have resulted in 
our method achieving higher power compared with those 
of the score test. It should be noted that for ELAVL4 gene 
the major association signal is borne by common variants, 
although according to the GAW17 phenotype model 
this gene contains only rare causal variants. To confirm 
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Figure 3 Some results of GAW17 analysis. Panel 1: Comparison of the proposed method (with different partitionings) with the score test for 
Qi and Q2 causal genes and respective quantitative traits (d causal genes are those from ARNT to VEGFA, O2 causal genes are those from BCHE 
to VWF); Panel 2: Performance of the proposed method (with different partitionings) and the score test for the causal genes and a dichotomous 
trait; Panel 3: Comparison of the proposed method (MAF partitioning) with other methods on d causal genes. 



this hypothesis, we performed an association analysis of 
ELAVL4 common SNPs with Qi trait using the score test. 
The power to identify an association was 79% after adjust- 
ing for population stratification and confounders. For the 
genes KDR, BCHE and S0S2 we did not observe any clear 
separation of associated variants using any of the consid- 
ered partitionings. 

GAW17 analysis results: comparison with other tests 

We contrasted the performance of our method (MAF 
partitioning) with those of WSCS [22], VT [18], SSUw 



[23] and SKAT-O with linear kernel and beta weights 
[24]. Also, we included some of the published results ob- 
tained by the participants of the GAW17 workshop: the 
performance of Multivariate Distance Matrix Regression 
(MDMR) and Mantel tests on Q2 causal genes [28], the 
performance of aggregated U-test (AggregateU) and 
CMC method (QuTie) on Qi causal genes [29]. It should 
be noted that the adjustment for covariates and population 
stratification for SKAT-O test was done via the function 
"SKAT_Null_Moder' of the R (http://cran.r-project.org) 
package "SKAT". Also, in our implementation of the WSCS 
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Table 1 Suggested reasons for the difference in power between the score test and our approach for some genes 



Genes 



Partitioning 



Suggested reason for the difference in performance 



Genes for which our method outperformed the score test 

All common non-synonymous SNPs are causal. 
Association of the three common non-causal SNPs with 
The only common SNP is causal. 
The only common SNP is causal. 
All common non-synonymous SNPs are causal. 
All common non-synonymous SNPs are causal. 
Genes for which the score test outperformed our method 

No clear separation of associated variants from neutral ones using any of the considered partitionings. 



ARNT 
ELAVL4 
HIFIA 
VNNl 
FLTl 
PIK3C3 

KDR, BCHE, S0S2 



MAF and functionality 
MAF 
MAF 
MAF 

MAF and functionality 
MAF and functionality 



The table contains only those genes for which there was a significant difference in performance between the score test and our approach for any of the 
suggested partitionings. 

*for details, see the subsection "GAW17 analysis results: comparison with the score test". 



method rare variant weights were proportional to the abso- 
lute value of correlation between those rare variants and 
phenotype, since the original weights described by Dai et al 
[22] were not applicable due to the adjusted phenotype 
being quantitative. Additional file 4 shows the empirical 
type-1 error estimates for WSCS, VT, SSUw and SKAT-O 
tests. Given that the 99% confidence interval for the empir- 
ical estimate of type-1 error rate is 0.015-0.095 as described 
previously, there was no evidence for type-1 error inflation. 

Panel 3 of Figure 3 depicts the analysis results for Qi 
causal genes. As can be seen, our method was among 
the top three most powerful approaches for ANRT, 
ELAVL4 and HIFIA genes. Panel 1 of Additional file 5 
shows the results for Q2 causal genes. Our method was 
the top performer for VNN3 gene, and was among the 
top three performing approaches for LPL and VNNl 
genes. Panel 2 of Additional file 5 depicts the results of 
an association analysis of all causal genes with dichot- 
omous phenotype. As can be seen, SKAT-O achieved 
the most notable power gains for PIK3C3 and PTK2B 
genes. This emphasizes that SKAT-O may significantly 
outperform other tests under some phenotype models. 
However, this method also may significantly underper- 
form other tests for some phenotype models, for ex- 
ample, ELAVL4 gene with Qi trait and VNNl gene with 
Q2 trait. Our proposed method was among the top three 
performing methods for S0S2, PTK2B, PRKCA genes 
and some other genes for which the power of all the 
methods was low. 

Discussion 

In this article we have described a methodology that in- 
corporates prior information into a region-based score 
test with the purpose of improving power to identify an 
association. Prior information, such as observed MAF, 
functional annotation, predicted deleteriousness of non- 
synonymous variants, may be used to partition variants 
within a region of interest. Then, this partitioning is 



utilized to construct a test statistic whose distribution 
under the null hypothesis has lower degrees of freedom 
compared with those of the score test. Based on the 
theoretical power comparison, population genetics simu- 
lations and GAW17 sequencing data analysis, we have 
shown that under some scenarios our method may per- 
form as well as or outperform other methods. 

Our suggested partitioning is splitting by functionality 
and if rare variants are present - by MAF with an arbi- 
trary threshold, e.g., 1%. One of the major justifications 
for considering MAF in a partitioning design is that 
MAF may distinguish between different evolutionary 
forces acting on causal variants. Evolutionary theory pre- 
dicts that variants that confer susceptibility to a disease 
which in turn reduces fitness are expected to have 
low MAF due to purifying selection. Empirically, using 
whole-exome sequencing data, it was found that non- 
synonymous substitutions are more significantly skewed 
towards low frequencies compared with synonymous 
variants. This finding "almost certainly reflects the oper- 
ation of purifying selection" [30] acting on many genes 
across genome. Thus, if a causal gene is under strong 
purifying selection incorporating MAF into partitioning 
design is likely to be beneficial. On the other hand, a 
causal variant could have risen to higher frequency due 
to other forces such as balancing selection, mutation- 
selection balance, antagonistic pleiotropy, etc. [31]. Thus, 
for example, if a causal variant has been under a strong 
balancing selection it is likely to be common; and incorpor- 
ating MAF into partitioning design may lead to power im- 
provement. Partitioning by functionality may be beneficial 
for gene-based analysis in cases when an association signal 
comes from one or several functional groups of variants. 
For example, if highly deleterious non-synonymous variants 
within exons of a gene are causal, then partitioning by 
functionality is likely to improve power. Although the mis- 
classification rate for prediction tools may be high (e.g., 
PolyPhen-2 achieved 92% power to detect truly damaging 
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variants at 20% type-1 error rate over HumDiv data [32]), 
our method may still benefit from grouping variants by 
functional significance, since this is an effective way to give 
more emphasis on variants that are more likely to be causal 
[33]. The benefit of using frinctional information was dem- 
onstrated in several simulations and candidate gene se- 
quencing studies [18,34,35]. 

There are a few limitations of the proposed methodology. 
First, it is unknown in advance which prior information is 
relevant for a given genomic region. Prior information that 
does not help to uncover groups of associated variants is 
likely to have negative impact on statistical power. Second, 
the test statistic (3) is, in general, not invariant with respect 
to the ordering of SNPs. This can be seen, for example, when 
the covariance matrix Vis invariant with respect to permuta- 
tion of elements of the score vector U (e.g., all the cross-SNP 
covariances are the same, and diagonal variances are equal). 
So, for any SNP ordering the Choleslcy decomposition A is 
the same; hence, the test statistic is, in general, dependent on 
ordering. To clarify the ambiguity the ordering of SNPs ac- 
cording to a position on a chromosome may be assumed. 

Conclusions 

A novel statistical approach that incorporates prior in- 
formation in a region-based score test with the purpose 
of improving power has been proposed. Theoretical 
power comparison, population genetics simulations and 
the results of the GAW17 data set analysis showed that 
under some scenarios our method may perform as well 
as or outperform other methods. 

Methods 

Consider a rare variants association study of a genomic 
region with a dichotomous or quantitative trait. Let us 
introduce the following notations: genotype matrix G = 
{gyiiy n = l, / = 1, ...,1} coded as minor allele counts, 
where n and / are indices for individuals and variants re- 
spectively; Nxl vector of genotypes for the Ith variant 
gi = {gnh ^ = 1, A/}; mean of genotype for the /th variant 
g^; phenotype vector Y= {y^, n = l, A/}, mean of pheno- 
type F. Consider the generalized linear model g{Ey) = bo 
+ Gbf where b = {bi, bi) is a vector of regression coef- 
ficients, and ^ is a monotone link function. The score 
test statistic used to test an association of phenotype 
with genotype is as follows [36]: 

To = u^y-^u, (2) 

where u= < > is the vector of scores, 

V = Cov{G)^{y^-Yf is L xL the score covariance matrix as- 
sumed to be non-singular, and Cov(G) = {Cov(gi,gi') }^^/^^ 
is the LxL genotype covariance matrix. Under the null 



hypothesis (none of the variants is associated with pheno- 
type) To asymptotically follows chi-squared distribution 
with rank{V) degrees of freedom (d.f.). Under the alterna- 
tive hypothesis (at least one variant is associated with 
phenotype) the asymptotic distribution of (2) can be ap- 
proximated by a non-central chi-squared distribution. 

Following is the description of our proposed method. 
First, let us transform the coordinates of score vector 
U to be asymptotically independent under the null 
hypothesis. If we denote C= (A^)"^, where matrix A is a 
Cholesky decomposition of V {V=A^A), the vector S = 
CUy under the null hypothesis, is asymptotically distrib- 
uted as a standard multivariate normal random vector 
(since Cov{CU) = CCov{U)C^ = {A^Y VA" ' = {A^Y \A^A) 
A" ^ = 7). It should be noted that under the alternative hy- 
pothesis S is approximately distributed as a multivariate 
normal random vector with nonzero mean and unit covari- 
ance matrix under the assumption of V being a reasonable 
approximation for the covariance matrix of U [37]. In gen- 
eral, any other decomposition of the covariance matrix V of 
the form V=A^A can be used. We apply Cholesl<y decom- 
position because it is fast to compute even for big covari- 
ance matrices. 

Next, let us describe group statistics. Consider parti- 
tioning of L SNPs within a region into K disjoint groups 
Gy^, /:= 1, ...,/<r of size L/^, /:= 1, ...,/<r. This partitioning is 
done using prior information, for example: by observed 
MAF (below and above some arbitrary frequency thresh- 
old); by functional annotation (non-synonymous and 
synonymous for exon sequencing studies; exonic, in- 
tronic, 3'UTR, 5'UTR and intergenic for gene-based 
studies); by functional significance for non-synonymous 
SNPS (benign, probably and possibly damaging from 
SIFT or PolyPhen output) etc. For each group let us de- 
fine tk = k = 1, ...,/<r, where 5/, / = 1, L is the /th 

coordinate of the vector S, It should be noted that parti- 
tioning based on prior to 5', since S = {A^Y^ ^ is the 
vector in a new basis, which is the set of column vec- 
tors of the matrix A^ (old basis is a standard basis). 
Since the vector S is multivariate normal with unit co- 
variance matrix, t/^J< = 1, are asymptotically inde- 
pendent. Under the null hypothesis t/^ asymptotically 
follows central chi-squared distribution with L/^ degrees 
of freedom, whereas under the alternative hypothesis at 
least one of the ty^, /c= 1, ...,/<r is distributed as a non- 
central chi-squared random variable. Let us denote as Pi 
and PJ^ cumulative distribution function (CDF) of cen- 
tral chi-squared random variable with / d.f. and its in- 
verse respectively. Group statistics H/^ are defined as 
follows: 

H,=P-,'{PlM))- (3) 
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It should be noted that under the null hypothesis H/^ 
follows a central chi-squared distribution with 1 d.f. 
Finally, the test statistic of our proposed method is: 



k=l 



(4) 



Under the null hypothesis Ti is distributed as a chi- 
squared random variable with K d.f. Large value of Ti 
indicates the deviation from the null hypothesis. Trans- 
formation (3) is used to decrease the degrees of freedom 
of the final test, as some of the groups G/^, /:= 1, ...,/<r 
may contain only neutral variants. If /<th group does not 
contain associated SNPs, group G/^ contributes only 1 d.f. to 
the final test, whereas it adds L/^ d.f. to the score test. Thus, 
under "informative" partitioning our proposed method may 
gain power compared with the score test. 

There are few notes that should be taken into account 
when applying our method. First, the theoretical ap- 
proximation for distribution of t/^ may not hold under 
moderate sample sizes. Hence, in all our simulations and 
real data application we used the empirical CDF of t/^, 
denoted as i>, obtained via 1000 permutations under the 
null hypothesis. Instead of PiiX^k) in (3) we calculated a 
Gaussian kernel CDF estimate F/Xtk) with a multi-stage 
plug-in bandwidth of Polansky and Baker [38]. R (http:// 
cran.r-project.org) package 'kerdiest' contains all the ne- 
cessary algorithms. Second, the covariance matrix V may 
happen to be computationally or exactly singular when, 
for example, considering a region with high LD between 
common SNPs. In general, to avoid singularity issues we 
apply Monroe-Penrose pseudoinverse in (2) instead of 
This pseudoinverse is uniquely defined and equals the 
standard inverse matrix when V is non-singular. From the 
same considerations instead of usual Cholesky factoring 
we used generalized Cholesl<y decomposition, which is 
well defined for square symmetric positive semi-definite 
matrices. If Vis non-singular, the generalized Cholesky de- 
composition equals the usual Cholesky factoring. Also, it 
should be noted that in general our test statistic is not in- 
variant with respect to the ordering of variants within a re- 
gion. To clarif)^ the ambiguity we assume the ordering of 
SNPs according to a position on a chromosome (see the 
"Discussion" section). 

Theoretical power calculations 

To evaluate our method we compared its theoretical 
power with those of the score test. Denote the non- 
centrality parameter (NCP) of the test statistic (2) under 
the alternative hypothesis as r (the connection between r 
and effect size is provided in the Additional files 6 and 
7). Also, we will use the notations Pi^r and Pj) for the 
CDF of non-central chi-squared random variable with / 



d.f. and NCP r and its inverse respectively. The power of 
the score test with type-1 error rate a is: 



l-^I,r(^l-,,,l), 



(5) 



where qi-a,L is 1 - a quantile of the chi-squared distribu- 
tion with L d.f. Let us assume that all the causal SNPs 
are uncorrected with other SNPs within a region, and 
that they are in one of the K groups (without loss of 
generality, let it be the first group k=l). Then, the 
power of the proposed test is: 



i-p{H,+xl-i<qi-a,K). 



(6) 



since ^/z^ follows a central chi-squared distribution 

with K- 1 d.f., because all associated SNPs are in the 
first group. The probability in (6) can be computed using 
the convolution of two CDFs as follows: 



-1 < ^l-a,/c) 



P(Hi< q^_^j^-x)dPK-i{x[ 



(7) 



Under the alternative hypothesis non-centrality par- 
ameter of To and those of ti are equal, since both sta- 
tistics include all the associated SNPs within a region. 

Thus, the distribution of Hi is P^^ (^Pl^ {xLi,r^^> 

integrand in the equation (7) can be calculated as P 

(^1 < =^ii,r(^Zj(^i(^i-.,/c-^)))- This al- 

lows us to compute the theoretical power of our 
method. 



Population genetics simulations 

Boyko et al. [39] found a simple two-epoch expansion 
model to be one of the best-fit models for population gen- 
etics simulations of African- American genotype. Also, the 
authors found strong evidence of selective effects acting on 
new amino acid replacing mutations and inferred the 
distribution of those selective effects. The scaled population 
selection coefficient (defined as a selection coefficient 
multiplied by twice the effective population size) was as- 
sumed to follow a negative gamma distribution with the 
following parameters: shape 0.184 and scale 8200. Mutation 
rate was set at 1.8 x 10"^ per nucleotide per generation. 
The selective disadvantage was additive. Using the forward 
simulator SFS_CODE (http://sfscode.sourceforge.net) with 
the parameters described above, we generated 100 haplo- 
type pools for a 3 kb coding genomic region. Each haplo- 
type pool contained 4000 haplotypes. In our simulations we 
considered only non-synonymous variants. To generate a 
data replicate we first selected a haplotype pool at random. 
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Within that pool we sampled a pair of haplotypes at ran- 
dom and added those haplotypes to form a multi-site geno- 
type of an "individual". Dichotomous phenotype was 
assigned based on this multi-site genotype using a linear lo- 
gistic model through which we controlled the odds ratios 
of causal variants. The probability of a disease conditional 
on the wild type genotype was set at 1% for all phenotype 
models. 

We considered four phenotype scenarios. For the first 
model called "Rare" randomly chosen 50% of rare vari- 
ants (defined as those with MAF<1%) within a haplotype 
pool were assigned to be causal with uniform odds ratio 
of 3. For the "Low Frequency" scenario one randomly 
chosen low frequency SNP (MAF between 1% and 5% 
in a haplotype pool) was causal with odds ratio of 2.5. 
If there was no SNP with MAF between 1% and 5% in a 
haplotype pool, we selected a SNP with the lowest MAF 
above 5%. For the "Common" scenario one common 
SNP with MAF>5% in a haplotype pool was causal and 
had the odds ratio of 1.5. Finally, the "Interaction" sce- 
nario models the hypothesized interaction of common 
and rare variants in RET gene associated with Hirsch- 
sprung's disease [40,41]. The simulation framework for 
this scenario was previously described by Liu and Leal 
[42]. Briefly, 50% of rare variants were assigned to be 
causal. Each causal rare minor allele increased the odds 
of a disease by 6 times if and only if it was present on 
the same haplotype as a minor allele of a common SNP 
randomly chosen beforehand. 

Sample sizes were the following: 500 cases and 500 
controls for "Rare", "Low frequency" and "Common" 
scenarios; 1000 cases and 1000 controls for "Interaction" 
model. In total 1000 data replicates were generated for 
each scenario. Power was estimated as a proportion of 
data replicates significant at a fixed type-1 error of 0.05. 
For all the statistical tests 1000 permutations were ap- 
plied to estimate a significance level. 
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